Original posting - 02/21/92 This posting - 6/18/93 This document is intended to address four issues. First, why the big excitement about disk arrays? What do you actually get when you spend gigabucks on one of these boxes? Second, how can we put gigabytes of high performance, high reliability, hot swappable disk in our LAN servers without spending gigabucks? Third, how can we be sure that disk duplexing/mirroring really works, without jerking the power cord out of the disk drive and risking its destruction? The third issue is a necessary by-product of the second, that is, if the drives are going to be hot swappable, we should be able to swap them without crashing the server or destroying the drives. This gives us an easy way to test the mirroring logic built into Netware. Fourth, what do all these buzzwords mean? Please address any comments to: Les Lovesee, 76702,1167 WHAT IS A DRIVE ARRAY, AND WHY SHOULD I CARE? Not surprisingly, there are a lot of vendors claiming to offer a drive array solution. Also not surprising is the fact that many of these 'new' products are the same old products with the words 'drive array' or 'RAID X' tacked onto the product name. Because of this, I'm going to start with an overview of current array-like options. At the top, we have the totally independent subsystems like those available from Sanyo/Icon. This unit has its own processor(s) (both RISC-based and 680x0 versions, single and dual processor systems available), lots of RAM (64mb or more is common), and supports RAID levels 0,1, and 5. You are not limited to a single RAID approach, some of the drives can be configured as duplexed, some as RAID 5, etc. It connects to up to twelve hosts, these can be a mix of NetWare servers, Vines servers, OS/2or Windows NT machines, or Unix hosts. Connection is via independent SCSI-2 channels. Total disk storage can exceed 140 gigabytes, prices go well into the six figure range. At the next level, we have subsystems which connect several drives to a proprietary controller, this controller then connects to a standard SCSI bus and appears as a single drive. A parity drive is used to achieve fault tolerance. Most of these systems support hot swapping of drives, although it is easier in some than in others. This approach has the advantage of low impact on the server, that is, a proprietary host adapter isn't necessarily required, although the vendor sometimes offers his own high performance SCSI adapter. The actual drives used in these subsystems are sometimes ESDI, more often are SCSI. Vendors such as Ciprico and CORE offer systems that fall into this category. Another approach is to use a proprietary controller/adapter and driver combination in the server. Drives are pretty much standard, fault tolerance is parity based. Some of these systems support hot swapping, some do not. A good example of this is Compaq's IDA. A purely software-driven approach uses standard controllers and drives, and depends on the server (or host) CPU to perform the RAID operations. Finally, there are several vendors attempting to 'cash in' on the drive array fad with various packaging approaches. These systems don't offer anything really new in the way of fault tolerance or performance, most of these are just the same old drives and controllers in new sheet metal. There a a couple of important points to recognize here. First of all, duplexing is the Cadillac of fault tolerance. I don't think anyone will argue that duplexing provides the highest level of fault tolerance, and the best overall performance. Parity based systems offer few real advantages (at this time) for the small to medium server - the biggest theoretical advantage is cost, but many parity based arrays actually cost much more than buying the raw drives and doing your own duplexing, since duplexing comes 'free' with NetWare. Another potential advantage of arrays is slot savings - duplexing takes two slots, minimum, and if you start piling up drives you can hit the SCSI limit of 7 devices per channel pretty quickly, or in other words, if the biggest drive you can put on your SCSI adapter is 2 gb, then you can't really put more than 14 gb on a single SCSI channel, which means you'd be hard pressed to put more than 28 gb of duplexed disk in a single server (4 host adapters * 7 drives of 2gb each). Arrays generally don't suffer from this problem, because the fact that there are multiple disk drives in the box is hidden from the server - each array usually appears as a single drive, regardless of how many physical drives it contains. "But what about speed? Aren't arrays supposed to be faster?" Certainly, some array configurations offer potential for increased thruput. However, if the drive array in question attaches to a standard SCSI host adapter, it's still limited by the speed of the SCSI bus. If you take the same number of standard SCSI drives and attach them to a good host adapter and set NetWare to span a single volume over those drives, you'll probably get better performance than you would get with a RAID 3 or 5 configuration. On the other hand, if the array in question uses some kind of proprietary channel or has 100mb of of internal cache (like the Sanyo/Icon subsystems) it's gonna blow the doors off of any combination of standard drives attached via a standard SCSI channel. It's also going to COST a lot more . Then there is the data integrity issue - some high-zoot database server programs are unable to maintain transaction integrity when disk writes are cached. NetWare provides a raw I/O mechanism to get around this problem, but if the cache is hidden from NetWare out in the RAID subsystem, then the DBMS has no way of dealing with it. The general release of SFT III raises some additional questions. One might think that since you have to duplex the entire server anyway, the question of whether to duplex or 'raid' the drives is answered. Instead, whole new questions are raised. For example, it is still possible to attach RAID subsystems to each server, this gives you an increased level of fault tolerance - i.e., the same drive can fail in both subsystems and the server continues operation. I question the wisdom of this, because it seems very unlikely that the exact same drive in the mirror subsystem will fail. Rather, it seems much more likely that a second drive in the same subsystem will. For example, compare mirrored servers with four drives each to mirrored servers with 5 drives each in a RAID 3 or 5 configuration. If a single drive on server A fails, there is only a 1 in 7 chance that the next drive to fail will be the drive mirrored to the failed drive. In the RAID scenario, there is a 4 in 9 chance that the next drive to fail will bring down the server A array. This doesn't appear to be a problem because the other server continues to operate, but what happens when we bring the array back online? THE ENTIRE ARRAY HAS TO BE REMIRRORED! This could take quite a while, especially with RAID 5 since write operations are significantly slower. On the other hand, if RAID had not been implimented, then only the drives which actually failed would have to be remirrored. It's going to take someone with a background in statistical analysis and a thorough understanding of SFT III to put this question to rest. Summary: In my humble opinion, if you are building a new server which isn't expected to grow beyond 5 gb or so anytime soon, your best performance/fault tolerance combo is to stick with NetWare duplexing and span the volumes over multiple drives. If you need (or soon expect to need) more than 20 gb on a single server, you need to seriously consider one of the pre-packaged array products. Between 5 and 20 gb is sort of a grey area. The best solution for you is going to depend on your budget, your comfort level with your integrator (or self confidence if you are 'rolling your own'), and your performance and fault tolerance needs. HOW CAN I HOT SWAP MY DRIVES WITHOUT DESTROYING THEM OR MY POCKETBOOK?, and DOES DUPLEXING REALLY WORK? Disclaimer: I am presenting this information as a public service. Since what I am presenting is a strategy more than a product, and since I am receiving no compensation for doing so, I cannot be held responsible for whatever you do with this information. I have tested this approach, and feel confident that it works reliably, you should do the same before you place any system you derive from this into production status. There are several requirements to be met if you want to hot swap your drives. 1. You must have some kind of redundancy, otherwise, what good is a hot swap? Typically, this will be Netware mirroring or duplexing, or some kind of software RAID scheme. 2. You must have a way to power off each drive individually. It must remain grounded. Some drives use the spindle motor as a generator to get the heads parked, etc, after power loss. This can result in some weird voltages appearing on the SCSI bus data lines while the drive is winding down IF you just jerk the power connector out. This is prevented if the drive remains connected to system ground when power is removed. 3. You must be able to physically remove the drive without having to use a screwdriver. What happens if you drop a screw on the motherboard or onto the logic board of another drive? Poof! 4. Your SCSI channel must remain terminated regardless of which drive is removed. The other drives on the channel won't work otherwise. 5. Your SCSI host adapter and driver must be able to deal with a drive failure. Obviously, if the adapter/driver can't deal with a drive failure in a graceful manner, duplexing isn't going to do you any good anyway. Item 1 is taken care of by Netware duplexing/mirroring. Items 2 and 3 are easily handled by a device called a 'DataPort', manufactured by Connector Resources Unlimited in Milpitas, CA. Item 4 is a standard SCSI terminator plug, it connects to a standard SCSI cable connector (the one that looks like a giant Centronics printer connector). This part is widely available. (I have not yet located a SCSI terminator which connects to a ribbon cable. This would be the ideal for smaller systems, so if anyone knows where to get one of these, I'd appreciate an Email note.) The 'DataPort' consists of two mating bracket components. The outside component is made of metal, and screws into a half height 5 1/4" drive opening. It has a standard power connector and SCSI ribbon cable connector on the back, just like most SCSI drives. On the right front is a key switch which turns the power on and off, and locks the drive in place. The second part is made of plastic, and slides into the metal part. You install a 3 1/2" drive in the removeable plastic bracket, plug in the provided connectors, and you're in business. What do you put this into? If you're building a server, look for a chassis with lots of exposed drive bays. If you need more drives than that, there are SCSI drive subsystem boxes with standard connectors and power supplies available from several sources (including Connector Resources, Unlimited). If you're on a budget, consider recycling old XT cases - replace the power supplies with Turbo Cool units, yank out the motherboards, presto, instant 4 drive subsystem case. You'll need a couple of brackets for mounting the upper drive, I've purchased these from Jameco in the past, I'm pretty sure that CRU has these as well. How well does this actually work? Well, I put two brackets and drives in our office fileserver and duplexed them to each other. I then started a batch job on several workstations copying files around on this volume. I ran this for thirty minutes. While this was running, I followed this procedure: 1. Power off a drive, and remove it. 2. Wait for server to recognize drive has failed. 3. Reinsert drive, power up. 4. When drive is ready, reactivate it using the disk info option of the monitor screen. 5. Netware automatically begins remirroring. 6. When Netware signals that remirroring is complete, and drives are in sync, power down the other drive. 7. Repeat steps 2-6. In my case, the test was even more brutal, as one drive was SCSI, and the other IDE (DataPort is available in both SCSI and IDE flavors). When the IDE drive was powered down, all activity paused for the 5 seconds or so it took for the server to recognize the drive failure. (No, I'm not advocating that you should duplex IDE and SCSI drives, I just wanted to prove that it would work to my own satisfaction.) DEFINITIONS There are several concepts which are combined in different ways by different vendors. Each is described by a name, I will define these terms as I understand them, but keep in mind that not all vendors mean the same thing when they use these terms, there is no ANSI standard . Asynchronous data transfer. When used in the context of disk subsystems, refers to a data transfer method which requires an acknowledgement for each byte or word of data transfered. Standard SCSI uses asynchronous data transfer. ATA. See 'IDE". Channel. An intelligent path thru which data flows between host RAM and some kind of I/O controller, independent of the host CPU. This is a mainframe concept which has migrated down to the PC world. Under NetWare, a channel is any kind of disk controller or adapter, some are more intelligent than others. Controller. The electronic component which translates between the sequential bit stream used by the disk drive read/write electronics, and the parallel digital data which the host computer requires. On IDE and modern SCSI drives, the controller is built into the drive itself. In the case of MFM and ESDI drives in a PC, the controller is usually part of the interface card which plugs into the PC's bus. IDE and SCSI host adapters are often mistakenly called controllers. Disk Array / Drive Array. Typically, any time several drives are put into one box, it is called a Disk Array. A more correct definition would be any time several drives are put into a box and attached to a controller or interface which makes them 'look' like a single drive to the host . Duplexing. In this context, duplexing means keeping the same data on two or more drives. It differs from mirroring in the NetWare environment by providing a unique channel for each copy of the data. ESDI - Enhanced Small Device Interface. An improvement over the original ST506 disk interface, allows for higher data transfer rates. Host. This is the computer responsible for the data. Dumb terminal / timesharing users call their host a minicomputer or mainframe. Distributed processing / LAN users call their host a fileserver or just a server. Either way, it's the place where the data resides in the enterprise. Host Adapter. This is a plug in card which provides an interface between the host and some kind of external data transfer bus. The most common host adapters connect a host bus to SCSI peripherals. An IDE adapter is sometimes refered to as a host adapter, although it is little more than a bus buffer. Hot Swap. This refers to the ability to remove and insert components (drives in this case) from a computer or subsystem without interruption of service. You've got to be real careful about this one. While most vendors mean that you can replace a failed drive while the array is running, at least one vendor requires you to to shut down the entire array, but calls this a hot swap because the server is still running! IDE - Integrated Drive Electronics. Also refered to as ATA (AT Attachment). This drive interface emulates a standard IBM AT compatible MFM controller at the control register level. This means that drivers which work with a standard AT disk controller will almost always work with an IDE drive. Since all the controller electronics have been moved to the drive itself, only a very simple 'paddle board' with bus drivers and minimal address decoding is required to connect an IDE drive to the host bus. Mirroring. In the NetWare environment, this is the process of keeping a second copy of the data on a second drive attached to the same 'channel'. It is a subset of duplexing. Paddle Board. See IDE. Parity. Generally refers to any kind of error detection and/or correction scheme, even if it is not really parity based. In the case of drive arrays, one drive is often refered to as the parity drive, it contains data which can be used to re-create data on any one of the other drives given the continued proper operation of all the rest. For example, you might have a five drive array where one drive is designated as the parity drive. If any of the other four drives fails, the data from the remaining three drives and the parity drive can be used to re-create the data on the lost drive. Once the failed drive is replaced, the drive array controller will rebuild that drive using the parity drive and the other functioning drives. Usually, some kind of exclusive OR scheme is used, and there is always some loss of performance associated with the rebuild process. Also, if you have already lost one drive, and another fails before the rebuild completes, you're hosed, unlike mirroring or duplexing. RAID - Redundant Arrays of Inexpensive Disks. The theory is that instead of buying gigantic, monolithic drives (SLEDs), you combine multiple smaller drives to save money, increase performance, and reliability. In practice, many RAID systems cost more and are not as fast as the SLEDs they replace. There are several types of RAID, each is identified by number. RAID 0 incorporates data striping only, this results in a performance improvement, but no increase in reliability, in fact, just the opposite. If any drive in the RAID 0 array fails, all data is lost. RAID 1 is mirroring. That is, each data drive is actually a pair of drives containing the same data. This increases reliability, and can increase performance if seperate channels are used for the primary and mirror drive. RAID 2 uses Hamming codes for error detection/correction, these are interleaved right in the bit stream. This sounds like a good idea, but Hamming codes are designed to not only correct the bad data, but to also identify the failed bit. Disk drives already have mechanisms for detecting that a failure has occured, so this results in unnecessary additional storage overhead. RAID 3 was initially the most common 'array' technique (after mirroring), but has become less popular. It uses an even number of data drives (usually 4) and a dedicated parity drive. Each block that is to be written is divided up among the data drives, a parity block is generated (usually by XORing the data blocks) and all blocks including the parity block are written out to the drives simultaneously. Spindle synchronization is usually used to improve thruput. This RAID flavor is good at reading and writing LARGE amounts of data. RAID 4 supports an even or odd number of data drives. Rather than dividing a block of data between all drives, it writes individual blocks to individual drives. Error correction info is generated and stored on a dedicated parity drive, although this allows independent parallel reads of smaller chunks of data, it necessitates a 'read- modify-write' cycle to the parity drive for each data block written. This is the main drawback of this technique, the primary advantage is the independence of the drives, supposedly, interactive or transaction based applications will run faster with RAID 4 than RAID 3. RAID 5 is probably the most popular technique in use today. It is similar to RAID 4, but rather than using a dedicated parity drive, the parity information is interleaved among all the drives. A 'read-modify-write' cycle is still required for disk writes, but because of the way parity is distributed, it is often possible to perform multiple simultaneous writes. For example, if you want to write block 27 to drive 0, and block 500 to drive 2, it may work out that the associated parity blocks are on drive 1 and 3, this allows both writes to occur 'simultaneously' unlike RAID 4 where all parity is on the parity drive creating a write bottleneck. Once you get beyond the original six levels (0-5) you enter a twilite zone where the definition of a particular RAID level depends on who you are talking to. RAID 6 has at least three definitions. The first is an attempt to solve the multi-drive failure problem. All of the RAID schemes which depend on parity rather than redundancy are unable to deal with more than a single drive failing. In RAID 6, an extra copy of each parity block is written to a different drive. This improves fault tolerance at the price of increased storage overhead. The second definition was coined by AST - it refers to their combination of RAID 0 and 1. Hmm, sounds like NetWare spanning and mirroring to me. Finally, I have read a couple articles where a combination of caching with other RAID levels is called RAID 6. RAID 7 is used by two different vendors to denote different products. Pace Technologies, Inc., uses the term RAID 7 to denote what is essentially a hot spare approach added on to RAID 0, 3 or 5. Another vendor, Storage Computer Corp. (StorComp) uses the term as the name of their patented disk subsystem. Here is what they say about it. "RAID 7 incorporates the industry's first asynchronous hardware design to move drive heads independently of each other, and a real-time embedded operating system to control access to the disk drives and data flow to the host interfaces. As a result, RAID 7 platforms exceed single-spindle performance for all four basic disk metrics of small reads and writes, and large reads and writes, in any combination." ( EDGE: Work-Group Computing Report Nov 13 1992 v3 n130 p31) Hmm, I though RAID 4 and 5 operated the drives independently of each other, at least in read mode. Another key feature which is unique is the ability to use varying numbers of drives of different sizes in the array. Scattering. When spanning is used under Netware, blocks of data are scattered among the drives which have been spanned into a single volume. This differs from striping in that complete blocks are written to each drive, rather than dividing one block up between several drives. It's easier to impliment from a hardware standpoint because it doesn't require spindle synchronization, but may not provide all the potential thruput gains of striping. Note that it can provide SOME performance advantages because each drive still has its own controller, and can position its heads, read, write, etc. independently from the other drives. SCSI - Small Computer Systems Interface. Pronounced 'SCUZZY'. This is an 8 bit data channel for connecting disk drives, tape systems, and other block oriented devices to a host system. It supports a maximum of one 'master' (the host) and seven 'slaves' (the peripherals). In addition to standard SCSI (SCSI-1), there is SCSI-2 which supports synchronous data transfer at higher data rates, and SCSI- WIDE which expands the 8 bit data path to 16 or 32 bits. The SCSI standard evolved from the old SASI (Shugart Associates Standard Interface) bus. Segment. This is the smallest read/write unit supported by some types of RAID. It may be user configurable so best performance for a particular application can be obtained. A segment has much in common with a Netware block. SLED - Single Large Expensive Disk - opposite of RAID. Spanning. This is the process by which Netware allows you to combine multiple physical drives into a single volume. It is software driven rather than hardware driven, but doesn't use much in the way of server resources. Spindle Synchronization. This means just what it says, that is, keeping all the spindles on all the drives synchronized so that sector zero goes by the heads on all the drives at the same time. This is done to improve thruput when data is striped across multiple drives. Not all drives have this capability, and it usually requires some external hardware even for drives which support it. Striping. (pronounced with a long I) This is the process of dividing a single logical block into multiple physical blocks on multiple disk drives. The idea behind this is increased thruput by dividing the data up among multiple parallel data paths, the implication being that each drive has its own controller which can operate in parallel with the other drives controllers. Often this is done as part of a parity scheme. Spindle synchronization is required for true striping to work most efficiently. Synchronous data transfer. When used in the context of disk subsystems, refers to a data transfer method in which each byte or word is 'clocked' from the sender to the receiver at a fixed rate. This is faster than asynchronous data transfer because the time used to acknowledge each byte or word is eliminated. XOR. Short for eXclusive OR, this binary logic function is sometimes used to generate parity or error correction data. Arithmetically, it is an add without carry between bits. This is the XOR truth table: A in B in Out 0 0 0 0 1 1 1 0 1 1 1 0 Here is an example of how XOR might be used to generate a parity block, and recover from a lost block. Assume four data drives and one parity drive. Assume the logical block size is 4k, this is the default for NetWare. Divide the 4k block into four 1k segments. Perform an XOR across the segments, generating a parity block, or in other words, XOR the first bit of each 1k block, producing the first bit of the parity block, repeat for each bit of data. When finished, write the four 1k data blocks out to the data drives, and the 1k parity block out to the parity drive. To reconstruct data from a failed drive, read the 1k data blocks from the three remaining data drives, and the 1k parity block from the parity drive. XOR the first bit of each 1k block including the parity block, the result is the first bit of the missing block. Repeat for each bit of the missing block. Here is an example using a 16 bit word to illustrate every possible combination of 4 bits. (thanks to my trusty HP16C for helping me get it right) Write original data Read data missing drive 2 0 - 0101010101010101 0 - 0101010101010101 1 - 0011001100110011 1 - 0011001100110011 2 - 0000111100001111 P - 0110100110010110 3 - 0000000011111111 3 - 0000000011111111 ------------------------------- ---------- --------------------- P - 0110100110010110 2- 0000111100001111 Read data missing drive 0 Read data missing drive 1 P - 0110100110010110 0 - 0101010101010101 1 - 0011001100110011 P - 0110100110010110 2 - 0000111100001111 2 - 0000111100001111 3 - 0000000011111111 3 - 0000000011111111 ------------------------------- ---------- --------------------- 0 - 0101010101010101 1- 0011001100110011 It becomes obvious as you study this approach that it isn't limited to four data drives and one parity drive. 